
The ideal machine learning project involves general flow analysis stages for building a Predicting Model. Steps followed to perform data analysis:
Question: why not a model to predict if a project will be successful, failed or cancelled based on given dataset? List of possible predicting factors:
- failed 52.22
- successful 35.38
- canceled 10.24
- undefined 0.94
- live 0.74
- suspended 0.49
Cancelled State There are 10% of projects in this dataset are in cancelled state. For Example, Project owner got funding from somewhere else or the project requirements changed which let him recreate online crowd funding campaign.
Since there is no clear reason given in this dataset for Project to get cancelled or no date on which it got cancelled. here, Canceled state should be considered as separate state and not failed.
Total Projects: 378661 Total Features: 15
| ID | name | category | main_category | currency | deadline | goal | launched | pledged | state | backers | country | usd pledged | usd_pledged_real | usd_goal_real | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000002330 | The Songs of Adelaide & Abullah | Poetry | Publishing | GBP | 2015-10-09 | 1000.0 | 2015-08-11 12:12:28 | 0.0 | failed | 0 | GB | 0.0 | 0.0 | 1533.95 |
| 1 | 1000003930 | Greeting From Earth: ZGAC Arts Capsule For ET | Narrative Film | Film & Video | USD | 2017-11-01 | 30000.0 | 2017-09-02 04:43:57 | 2421.0 | failed | 15 | US | 100.0 | 2421.0 | 30000.00 |
| 2 | 1000004038 | Where is Hank? | Narrative Film | Film & Video | USD | 2013-02-26 | 45000.0 | 2013-01-12 00:20:50 | 220.0 | failed | 3 | US | 220.0 | 220.0 | 45000.00 |
| 3 | 1000007540 | ToshiCapital Rekordz Needs Help to Complete Album | Music | Music | USD | 2012-04-16 | 5000.0 | 2012-03-17 03:24:11 | 1.0 | failed | 1 | US | 1.0 | 1.0 | 5000.00 |
| 4 | 1000011046 | Community Film Project: The Art of Neighborhoo... | Film & Video | Film & Video | USD | 2015-08-29 | 19500.0 | 2015-07-04 08:35:03 | 1283.0 | canceled | 14 | US | 1283.0 | 1283.0 | 19500.00 |
Note: Name column has 4 Nan whereas usd pledged is got 3797 NaN values. This rows can be directly removed as dataset is big enough to perfrom data analysis.
Numeric variables such as backers, usd_pledged_real, usd_goal_real are higly right skewed because of so many failed instances not having single backers or pledged amount raised. This will be addressed through data normalization while developing a model.
To explore these data it needs to be transformed and then histogram should be created to visualize distributions.
| Column | usd_goal_real_log | usd_pledged_real_log |
|---|---|---|
| skew | 12.765938 | 82.063085 |
| count | 369678.000000 | 369678.000000 |
| mean | 8.632460 | 5.775453 |
| std | 1.671539 | 3.309677 |
| min | 0.009950 | 0.000000 |
| 25% | 7.601402 | 3.526361 |
| 50% | 8.612685 | 6.456770 |
| 75% | 9.662097 | 8.314587 |
| max | 14.591996 | 16.828050 |
Minimum goal amount is as small as 0.01
This is the format of your plot grid: [ (1,1) x1,y1 - ] [ (2,1) x2,y2 ] [ (2,2) x3,y3 ]